Building a Morphosyntactic Lexicon and a Pre-syntactic Processing Chain for Polish
نویسنده
چکیده
This paper introduces a new set of tools and resources for Polish which cover all the steps required to transform a raw unrestricted text into a reasonable input for a parser. This includes (1) a large-coverage morphological lexicon, developed thanks to the IPI PAN corpus as well as a lexical acquisition techique, and (2) multiple tools for spelling correction, segmentation, tokenization and named entity recognition. This processing chain is also able to deal with the XCES format both as input and output, hence allowing to improve XCES corpora such as the IPI PAN corpus itself. This allows us to give a brief qualitative evaluation of the lexicon and of the processing chain.
منابع مشابه
Extracting Semantic Classes and Morphosyntactic Features for English-Polish Machine Translation
This paper describes a procedure aimed at automatic extraction of certain noun and verb categories from Polish texts. The general goal is to construct a lexical database that should be incorporated into a system for machine translation and multilingual generation of summaries. High quality processing of inflectional languages like Polish requires quite elaborated lexical entries, it is therefor...
متن کاملSyntactic Structure of Polish Proper Names of Places
The aim of this presentation is to introduce a syntactic analysis of Polish proper names of places. This study focuses on names of places in Warsaw, mainly places that have a postal address. We will introduce briefly the domain of our interest and give statistics of the collected data. Then, we will review types of syntactic structures that have the highest frequency in our corpus. The goal of ...
متن کاملOntology-Based Lexicon of Bulgarian
In contrast to morphological and syntactic processing semantic annota tion based on domain ontology is still underdeveloped for Bulgarian. On the other hand, the prerequisites for an ontological annotation are already available. These are as follows: a morphosyntactic tagger for Bulgarian with more than 95% accuracy; a dependency parser with more than 84% accura cy; a general chunker and a na...
متن کاملVerbal Morphosyntactic Disambiguation through Topological Field Recognition in German-Language Law Texts
The morphosyntactic disambiguation of verbs is a crucial pre-processing step for the syntactic analysis of morphologically rich languages like German and domains with complex clause structures like law texts. This paper explores how much linguistically motivated rules can contribute to the task. It introduces an incremental system of verbal morphosyntactic disambiguation that exploits the conce...
متن کاملError Mining on Syntactic Parser Output
We introduce an error mining technique for automatically detecting errors in resources that are used in parsing systems. We applied this technique to parsing results produced on several million words by two distinct parsing systems, which share the syntactic lexicon and the pre-parsing processing chain. We are thus able to identify incorrectness and incompleteness sources in the resources. In p...
متن کامل